SG-WRAM Schema Guided Wrapper Maintenance
نویسندگان
چکیده
The World Wide Web has become one of the most important connections to various sources of information. A large proportion of the data is embedded in HTML documents. This language serves the visual presentation of data in Internet browser, but does not provide semantic information for the data presented. This form of data presentation is, therefore, inappropriate for the demands of automated, computer assisted information management system. In particular, if data from different sources needs to be combined, it is necessary to develop special and often complex programs to automate the data extraction. Wrappers are specialized program routines to fulfil such tasks. They automatically extract data from Internet web sites and convert the information into a structured format. As the manual coding of wrappers is timeconsuming and error-prone process, different methods [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12] have been proposed to automate the wrapper generation process. As a rule, however, a specially developed wrapper is required for each individual data source, because of the different and unique structures of web sites. The WWW is also extremely dynamic and continually evolving, which results in frequent changes in the structures of web documents. Consequently, wrappers may stop working when the structures of the corresponding documents are changed no matter how they have been generated. It is often necessary to constantly update or even completely rewrite existing wrappers, in order to maintain the desired data extraction capabilities. The simplest way to maintain wrappers is to re-create wrappers using the new HTML documents. Obviously, this method is inefficient in that the maintenance depends mostly on the system developers. In this demo, we propose a novel schema-guided approach for wrapper maintenance, called SG-WRAM, which is based on our previous work, a schema-guided wrapper generation system (SG-WRAP[8,9]). SG-WRAP can generate a wrapper to extract data from an HTML document to produce an XML document conforming to the user-defined Schema. Although changes of HTML documents are extremely various, some features of desired information in previous document, e.g. syntactic features, data pattern, notation and underlying schemas are still preserved in the changed one. Syntactic features, data pattern and notation can be easily obtained from schemas, previous rules and extracting results. Therefore, it is feasible to recognize data items in the changed document using these features. Based on these observations, we fulfill the maintenance following four sequential steps. At First, syntactic features, data pattern and notation are obtained from the schema, previous rule and extracted results, then they are used to recognize the data items. After that, they are grouped according to the given schema. Each group is an instance of the given schema. At last, the representative instances are selected to re-induce the extraction rule. We name these four steps as features discovery, item recovery, block configuration and wrapper reparation respectively. Our schema guided method for wrapper maintenance has several unique features comparing to the related work. We make good use of schema, which is given by user during the process of wrapper generation, to assist the procedures of item recovery and block configuration; Our experience with real-life web documents shows that our method can deal with the changes from simple to complex including context shift, structural shift [12] and hybrid changes; In our system, we give different method for simple changes in which condition a part of the rule is disabled and the complex changes in which condition most of the rule is disabled. That makes the re-inducted rule more accurate and complete.
منابع مشابه
SG-WRAP: A Schema-Guided Wrapper Generator
With the development of the Internet, the World-WideWeb has become everyone’s invaluable information source. However, most of data on the Web is currently in the form of HTML pages, which is neither well-structured nor associated with schema. It is almost impossible to use such data efficiently. Web wrapper technology has been developed to transform unstructured /semi-structured data to semi-st...
متن کاملA Supervised Visual Wrapper Generator for Web-Data Extraction
Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper, we propose a novel schema-guided approach to wrapper generation. We provide a user-friendly interface that allows users to define the schema of the data to be extracted, and specifies mappings from a HTML page to the target schema. Based on...
متن کاملWrapper Maintenance
A Web wrapper is a software application that extracts information from a semi-structured source and converts it to a structured format. While semi-structured sources, such as Web pages, contain no explicitly specified schema, they do have an implicit grammar that can be used to identify relevant information in the document. A wrapper learning system analyzes page layout to generate either gramm...
متن کاملAn Effective Wrapper Architecture to Heterogeneous Data Source
In this paper, we focus on the problem in information integration system of obtaining data from heterogeneous data source accurately and effectively. XML is used as data exchange format of the wrapper. We design the wrapper architecture based on the conversion and management of the views as the bridge from global schema to local schema of various data sources. Our wrapper has two main subsystem...
متن کاملSuper-Fast XML Wrapper Generation in DB2: A Demonstration
The XML Wrapper is a new feature of the federated database capabilities of DB2/UDB v8. It enables users and applications to issue SQL queries against XML data from a variety of sources, including files and web services. The XML Wrapper assumes hierarchical XML documents modeled as families of virtual relational tables in a federated schema, which can then be queried to extract information from ...
متن کامل